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(57) ABSTRACT 

Speaker identification is performed using a single Gaussian 
mixture model (GMM) for multiple speakers — ^referred to 
herein as a Discriminative Gaussian mixture model 
(DGMM). A likelihood sum of the single GMM is factored 
into two parts, one of which depends only on the Gaussian 
mixture model, and the other of which is a discriminative 
term. The discriminative term allows for the use of a binary 
classifier, such as a support vector machine (SVM). In one 
embodiment of the invention, a voice messaging system 
incorporates a DGMM to identify the speaker who generated 
a message, if that speaker is a member of a chosen list of 
target speakers, or to identify the speaker as a "non-target" 
otherwise. 

27 Claims, 5 Drawing Sheets 
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DISCRIMINATIVE GAUSSIAN MlXfURE SUMMARY OF THE INVENTION 

MODELS FOR SPEAKER VERIFICATION „ ^ . ^. ^. ^ ^ 

Unfortunately, the above-described approaches to 

CROSS-REFERENCE TO RELATED speaker-identLfication are not inherently discriminative, in 

PROVISIONAL APPLICATION 5 ^ given speaker's model(s) are trained only on that 

speaker's data, and effective discrimination relies to a large 

This application claims the benefit of the Nov. 18, 1998, extent on finding effective score normalization and thresh- 

filing date of Provisional Application Serial No. 60/108,972 olding techniques. Therefore, I have developed an altema- 

entitled "Discriminative Gaussian Mixture Models For tivc approach that adds explicit discrimination to the GMM 

Speaker Identification". method. In particular, and in accordance with the invention, 

I have developed a way to perform speaker identification 

HELD OF THE INVENTION ^^^^ ^^^^ ^ single Gaussian mixture model (GMM) for 

Tliis invention relates generally to methods and apparatus ^^^^^V^^ speaker-referred to herein as a Discriminative 

for use in performing speaker identification. Gaussian mixture model (DGMM). 

15 In an illustrative embodiment of the invention, a DGMM 

BACKGROUND OF THE INVENTION comprises a single GMM that is used for all speakers. A 

In systems that provide for identification of a speaker, a likeHhood sum of the GMM is factored into two parts, one 

general technique is to score the speaker's enunciaUon of a ""^"^^'^^ depends only on the Gaussian mixture model, and 

test phrase against each one of a number of individual ^^le other of which is a discnrainative term. The discrimi- 

Gaussian mixture models (GMM) and to select, or identify, ^° "^^'^^ ^^'"^ ^"^^ ^""[^^ °f ^ ^'"^^^y ^^l^ssifier, such as 

the speaker as that person associated with the individual ^ ^^PP^'"* "^^^^^^ 

GMM, or set of GMMs, achieving the best score above a In another embodiment of the invention, a voice messag- 

certain threshold using, e.g., a maximum likelihood tech- ing system incorporates a DGMM. The voice messaging 

nique. Typically, these systems generate individual GMMs system comprises a private branch exchange (PBX) and a 

by independently training, a priori, on small (e.g., 30 miUi- plurality of user terminals, e.g., telephones, personal 

second (ms.)) speech samples of training phrases spoken by computers, etc. 
the respective person. 

Unfortunately, such systems do not perform well when BRIEF DESCRIPTION OF THE DRAWINGS 

attempting to discriminate the true speaker from people that r^r^ ^ - -.i . ^ l . »_ ^ • 

1 J 1-1 * 1 A u • w ** FIG. 1 IS an illustrative flow chart embodying the prin- 

merely sound like the true speaker. As such, in an attempt to . i ^ - j ^ & r 

A- • ' * • *u JLu ciples or the invention: and 

improve discrimination these systems increase the number ^ ' 

of GMMs to include "cohort" or "background" models, i.e., FIGS. 2-5 show illustrative block diagrams of DGMM 

people that sound like the true speaker but are not (e.g., see voice mail systems embodying the principles of the inven- 

Herbert Gish and Michael Schmidt, "Text- independent 35 tion. 
speaker identification," IEEE Signal Processing Magazine, 

pages 18-32, 1994). DETAILED DESCRIPTION 

Alternatively, for both the speech and speaker recognition „ ^ j .„ . . r u 

problems, a different approach has recenUy been proposed , before describing an illustrative embodiment of the 

which uses a discriminative cost finction (which measures ,0 ^ ^^^^ background is provided on the above - 

the empirical risk) during training in place of the maximum '"^^^^.^^^^^ P"^^,^^ ^P^.^^^^ ''^^"V'^xx t '^f'^f ?i '''^''^ 

likelihood estimation, giving significantly improved gener- ^dividual, non-discnminaUve, GMMs In the following 

alization performance (e.g., see, Biing-Hwang Juang, Wu description, the phrase target speaker' means a speaker 

Chou, and Chin-Hui Lee, "Minimum Classification Error ^^^^^^ ^^^^^^^J respective system is supposed to deter- 

Rate Methods for Speech Recognition," IEEE Transactions 45 (note, there may be non-target speakers m the traimng 

on Speech and Audio Processing, 5(3):257-265, 1997; and test sets). Other than the mvenUve concept, it is assumed 

Chi-Shi Lui Chin-Hui Lee, Wu Chou, Biing-Hwang Juang, ^^^^ the reader is familiar with mathematical notaUon used to 

and Aaron E. Rosenberg, "A study on minimum error S^^^^^^^y represent kernel-based methods as known m the 

discriminative training for speaker recognition," Journal of Components of vectors and matrices are labeled with 

the Acoustical Society of America, 97(1): 637-648, 1995). 50 Greek indices, vectors and matrices themselves are labeled 

However, here the underlying model (a set of hidden "^^^^ ^^^^i^^^- 

Markov models) is left unchanged, and in the speaker Gaussian Mixture Models: The Standard Approach 

recognition case, only the small vocabulary case of isolated In a typical GMM system, one (or several) GMMs are 

digits was considered. built for each speaker The data is preprocessed into low 

In providing speaker identificaUon systems such as 55 dimensional feature vectors, typically using 20. ms frame 

described above, support vector machines (SVMs) have sizes and 10 ms steps. The features used are often some or 

been used for the speaker identification task directly, by all of the following (as known in the art): vectors of cepstral, 

training onc-vcrsus-rcst and one-versus-another classifiers delU-cepstral, and delta-delta-cepstral coefificients, a scalar 

on the preprocessed data (e.g., see M. Schmidt, "Identifying measure of energy, and a scalar measure of pitch, 

speaker with support vector networks," Interface '96 60 A Gaussian mixture model is simply a weighted sum of 

Proceedings f Sydney, 1996). However, in such SVM -based Gaussian densities, where the positive (scalar) weights sum 

speaker identification systems, training and testing are both to unity (which results in the sum itself being a density). It 

orders of magnitude slower than, and the resulting perfor- is desired to model the posterior conditional probability 

mance is similar to, that of competing systems (e.g., see also, P(Sjxj , . . . , x^), where S,.^ labels one of N^ target 

National Institute for Standards and Technology, Speaker 65 speakers (i-1, . . . , N^), and x^eW" is one of m feature 

recognition workshop. Technical Report, Maritime Institute vectors, each of which is derived from a different portion of 

of Technology, Mar. 27-28, 1996). the speech signal. B ayes' rule gives: 
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4=1 



(2) 



p{x,\Si,Cj) = W^I^\Y^\-'^\xp 



(3) 



by the K means algorithm, the mean and covariance matrices 
are recomputed: 



where here and below, P denotes probabilities, p the corre- 
sponding densities, and r is shorthand for the set of feature 
vectors x-i, . . . , x„. Thus in order to find the target speaker 
who most likely generated a given set of test feature vectors 
X, the posterior probability P(S,-|x) is maximized over the 
choice of speaker S^. If it is assumed that all speakers have 
equal priors (P(S;)=constant) then clearly this amounts to 
finding the maximum likelihood (p(X|S^) over all target 
speakers. This approach is sufficient for the "closed set" 
problem, where the test speech is guaranteed to have origi- 
nated from a target speaker. In the harder, "open set" 
problem, the speech may or may not have been generated by 
a target speaker, and some thresholding algorithm is neces- 
sary to decide whether to reject or accept the message (i.e., 
whether to assert that it was generated by one of the target 
speakers). In the description that follows, the open set 
problem is considered and the assumption is made that the 
random variables x^, . . . , x^ are independent, so that the 
density (p(x|S,)) can be expanded as a product. The likeh- 
hood is then modeled as a Gaussian mixture: 



30 



where the sum in equation (2) is over a set of mutually 
exclusive and complete events ("clusters") C,- and where fi^j 
is the mean for the Gaussian distribution for speaker i's j'th 
cluster, is the corresponding covariance matrix (recall 
that i, j here label the vectors and matrices themselves, not 
their components), and its determinant (e.g., see A. A. 
Sveshnikov, Problems in probability theory, mathematical 
statistics and theory of random functions, Dover 
Publications, New York, 1978). 

To train a GMM for a given speaker S„ one starts by 
specifying how many Gaussians to use (Nc). This is typi- 
cally anywhere fi-om 20 to 150. Some in the art choose 50 
to roughly match the number of phonemes (e.g., see article 
by Gish and Schmidt, cited above). Then the training feature 
vectors are clustered into clusters, using, for example, 
the K means algorithm (e.g., see V. Fontaine, H. Leich, and 
J. Hennebert; "Influence of vector quantization on isolated 
word recognition," in M. J. J. Holt, C. F. N. Cowan, P. M 
Grant, and W. A. Sandham, editors, Signal Processing VII, 
Theories and Applications; Proceedings of EUSIPCO-94, 
Seventh European Signal Processing Conference, volume 1, 
pages 116-18, Lausanne, Switzerland, 1994, Eur. Assoc. 
Signal Process). The resulting means and covariances form 
the starting point for GMM training, for which a simple 
variant of the EM algorithm is. used (e.g., see, A. P. 
Dempster, N. M. Laird, and D. B. Rubin, "Maximum 
Likelihood from Incomplete Data via the EM Algorithm," 
Journal ofthe Royal Statistical Society B, 39(l):l-22, 1977; 
and the article by Fontaine et al, cited above). In particular, 
for each cluster, and given the cluster memberships output 



1 !^ 

'' a=l 

Zi !3 



(4) 



(5J 



In equations (4, 5), the x„ are those feature vectors which 
are members of cluster C„ where cluster membership is 
determined by maximum likelihood. Then the cluster mem- 
berships are recomputed using likelihoods computed using 
the new means and covariances. The maximum likelihood 
cluster for vector x is thus given by 



Ci « (jf) = argraax PiQ | jc) 

p[x\Ci)P{C!) 

= argmax 

q Pix) 

= argmax pix \ Ci)P(Ci) 



(6) 



25 



It should be noted that this is really a maximum posterior 
computation, since the priors P(C,.) (which are estimated by 
counting cluster memberships) are used. This two-step pro- 
cess is then iterated until convergence (i.e. until cluster 
membership stabilizes). 

In test phase, one computes the sum log likelihood 



]ogf\pixt\Si)='^lo&p{x^\S,) 



(7) 



35 



k=l k=l 



= V log^ Pix, \Si, Cj)P{Cj I Si) 



40 that a given test message (from which m feature vectors 
were extracted) was generated by a given speaker S/s 
GMM, and combined with a suitable normalization and 
thresholding scheme, either rejects the message (i.e., asserts 
that the message was not generated by a target speaker), or 

45 identifies the target and outputs a confidence value. Here 
p(Xg|S(,Cy) is given by equation (3), and P(C,|S,) is estimated 
by counting cluster memberships for speaker i's training 
data. 

An intriguing property of the Gaussian distribution is that 

50 the dependence of the above sum on the test data can be 
described solely in terms of the mean and covariance of the 
set of test feature vectors (e.g., see above -cited Gish et al. 
article). Furthermore, the resulting expression can be 
derived by asking for the likelihoods of two characteristic 

55 statistics of the test data, namely its mean and covariance 
matrix. These observations lead to the idea of using the 
likeh hoods of lest mean and covariance as contributions to 
a weighted sum of Likelihoods, resulting in the "modified 
Gaussian models" as described in the above-cited article of 

60 Gish et al. One can then easily incorporate the likelihood for 
the delta-cepstral covariance matrix into the sum (the delta - 
cepstral mean contains no new information). 

The key remaining problem is how to normafize the 
scores from different GMMs so that they can be compared, 

65 in such a way that good discrimination results. The above - 
cited article by Gish et al., suggests a number of possibili- 
ties. First, if one has training messages corresponding to 
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different sessions for a given speaker, at least one model is ^C^.k Cj)»P{Si\yi/x\cj) (9) 

built for each, in the hope of capturing channel or session , / v - . 

variations. One may further split a given training message y^M) ^^tP^* ^VM, given x as an mput, 

into pieces and build a GMM for each piece, in order to where the SVM has been tramed to distinguish speaker 

model variation due to, for example, the occurrence of s i from aU other speakers, so that a more positive y^^^ indicates 

multiple speakers in the same message. A test message is foore evidence in favor of speaker i. Equation (9) is an 

also segmented, and each test message is attached to which- approximation: given some feature vector x associated with 

ever model gives it the highest likelihood. The score is then cluster C,-, the probability that it was generated by speaker i 

normalized by dividing that likelihood by the highest occur- is modeled using a binary classifier, trained to distinguish 

ring likelihood for that message for all models for all other feature vectors from speaker i (in cluster j) from feature 

speakers. Often, many "cohort" speakers (a target speaker's vectors from all other speakers in cluster j. The assumption 

cohorts are a set of other speakers which are attributed high is that the information about the identity of speaker i that is 

likelihood scores by that speaker's model) are added to the encapsulated in the feature vector x can be extracted and 

training set to help in the score normalization and discrimi- represented by a real-valued function y(x). However this is 

nation. Even after normalization, there arc many possible exactly what a real-valued binary classifier is supposed to 

approaches to how to combine scores from different seg- 55 do. Clearly in the case where the data x are separable, in such 

ments. Fmally given such an approach it is usefd also to ^ ^^^^ ^^^^ y^^^^ ^ ^.^ correspondence with the 

estimate confidence values for the resulUng score (e.g., see ^ approximation becomes exact (the above prob- 

the above-cited article by Gish et ^.). ^^.^.^.^^ ^^^^^ ^ ^ .^^ ^^^^ ^ 

Discnmmative Gaussian Mixture Models . , . « ,.i i . u i. ' j i_ 

Unfortunately, the above approach is not inherently 20 g^en vector x is equally hkely to have been generated by 

discriminative, in that a given speaker's model(s) are trained ^^'^^^ speakers, the y,, are also not hkely to 

only on that speaker's data, and effective discrimination sigmficantly more positive for any particular speaker 1 

relies to a large extent on finding effective score normaliza- ^^y other speaker, leading to equal P(S,|y,/x), C^). 

tion and thresholding techniques. Therefore, I have devel- '^e intermediate case, where the densities overlap, but 

oped an alternative approach that adds explicit discrimina- 25 ^^^"^^ discrimination is still possible, equation (9) esscn- 

tion to the GMM method. In particular, and in accordance tially uses the classifier alone as a method for function 

with the invention, I have developed a way to perform approximation, where the function being approximated is 

speaker identification that uses a single Gaussian mixture the posterior probability. 

model (GMM) for multiple speakers — referred to herein as Again, using Bayes' rule, the following can be written: 
a Discriminative Gaussian mixture model (DGMM). Other 

than the inventive concept, the techniques described below piyijix)\Si, Cj)P{Si\Cj) (10) 

encompass techniques found in standard speech preprocess- I ^'j piy '\C ) 

ing for use in speech recognition as well as speaker recog- 
nition. 

The DGMM approach is based on two key ideas. First, Since the y are real- valued outputs of a binary classifier, 

create a single GMM for all target speakers. Thus this GMM tjjg denominator can be written as the sum of two densities, 

is intended to model all speech generated by these speakers. one corresponding to outputs from the target speaker S,-, and 

(The building of a GMM to model all speech can also be one corresponding to outputs from all non-target speakers, 

explored.) Second, the basic GMM model is extended to denoted S : 
directly include discrimination. For this, Support Vector 

machines (SVMs) are used (although any binary classifiers 40 p(yij\Cj)=p(yi;\Si,c)P(Si\Cj)-^p(yi^S!\c)p(s^^ (11) 

could be used for the discrimination subsystem). l j • j • 

Consider a single feature vector x. The desired posterior ^hus, the desired posterior probabdity becomes (the role 

probability is expanded as: ^f y is discussed further below): 

FiS;U).Y.FiS.UCj)FiCj\x) W 45 = ^Ai.,S^, Cj)Bix, Cj) U^) 
J 

A{x,Si,Cj)= (13) 

Although the expansion is formally similar to that of piyiAx)\Si, Cj)P{Si\Cj) 

equation (2) (both are just expressions of the formula for piyij\^i^(^j)f'iSi\Cj) + p{yij\Si,Cj)P(St\Cj) 
total probability, and require only that the events Cy be 

mutually exclusive and complete), the meaning is rather g(;c, c ) ^ pUl^j)^^;) (14) 
different. In equation (2), the C are simply a partition of the ' (z (jc | c )P{C / )1 + 
density for a given speaker, and are only required to account [/ ^' J ^ 
for data generated by that speaker; the underlying assump- 
tion is that a particular observed vector x is generated by one 55 

and only one of the Cy. In equation (8), the underlying Note that p(y,;,<x)|S„Cy) and p(y,;^<x)P;,Cy) are easily mod- 
assumption is that a given x is still generated by one and only eled using the outputs of the classifier, since they are one 
one of the Cy (mutual exclusivity) and the probabilities P(Cy) dimensional densities which are straightforward to estimate 
still sum to one (completeness), but the Cj are no longer using a validation set, in contrast to the modeling of mul- 
associatcd with only a single speaker. However, one may 60 tidimcnsional densities. The P(S-|Cy) and P(Cy) can be csti- 
still think of them as clusters, in the sense that a given vector mated by counting cluster memberships. Note that if the 
x can be attached to a single cluster by maximum tikelihood training set is dominated by a small subset of speakers, 
(barring the unlikely event that x has the same maximum P(S,-|Cy) and P(St) will vary significantiy, and in some 
likelihood in more that one cluster). appUcations this may be useful prior knowledge. However, 
At this point, the two terms appearing in the sum in 65 if it is known that the application is one in which P(SJ will 
equation (8) are examined. The discrimination will be built vary significantly over time scales longer than that over 
into the first term, which is modeled as: which the training data was collected, in the absence of a 
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model for such variation, and assuming that the system is not Gaussian mixtures as an additional preprocessing step. The 
constantly retrained, it is safest to adopt the ^'uninformative model has the advantage that it need not be recomputed 
priors" position and assume that all P(S^ are equal. given a new set of target speakers. (It should be noted that 
As noted above, the inventive concept comprises the one might consider incremental training schemes, in which 
notion of using a single GMM, and performing dlscrimina- s the GMM is further trained with the target speakers added to 
tion within thai GMM. TWo examples are described below. the pool.) Furthermore, a set of highest-likelihood counter- 
In the first, the "target-speech" model, only speech from examples for classifier training can be computed once, and 
target speakers is used to create the single GMM. In the added to those computed from any speakers whose data 
second, the "all -speech" model, a single GMM is trained became available after the GMM was trained. As before, the 
using a large pool of speakers (possibly including the target p(x|Cy) are modeled as Gaussian densities, and P(Cy) arc 
speakers, possibly not), with the intent that the resulting estimated using cluster membership counts. In computing 
GMM is a representative model for all speech from which the posterior probability P(Sjx) in equation (12), P(C-) in 
targets are likely to be drawn. (Here, the idea is to train a equation (14) can computed from all the data, and P(o,|C ) 
GMM once using a very large amount of data (which is in equation (13) from cluster membership counts using only 
referred to herein as pool data, below), and then to use this the target data. Since the DGMM here models all speech 
as a fixed GMM for the speaker ID task). Note that in both from which target speech will be drawn, the role of the term 
methods, during the classifier training feature vectors from y in equation (14) is to give low likelihood to a non-speech 
non-target speakers can be used to supplement the set of signal, for example one that is dominated by channel noise, 
counterexamples. A key property shared by both approaches The term y can thus be viewed as a constant approximation 
is that the method "factors out" much of the variation in the to a sum over a set of other clusters, corresponding to 
signal before it is presented to the classifiers: thus to the 20 non-speech data. 

extent that the single GMM models phonemes, a given The description above was for the case of a single feature 

classifier, such as an SVM, is trained to distinguish a vector x. However, the inventive concept can be extended to 

particular phoneme for one speaker from that of other handle multiple feature vectors (from the same speaker), 

speakers. Again, it is desired to find that speaker S,- for which P(Sjx) 

Usually, when using SVMs for multiclass problems, the 25 is maximized. Again assuming that the x,- are independent: 
decision is made based on heuristic measures, such as the 

distance from the separating hyperplane (e.g., see C. Cortes f\s \]c) - P^^^^'^^^'^ 

and V. Vapnik, "Support vector networks," Machine ' " pi^) 

Learning, 20:273-297, 1995; and C. J. C. Burges, "A tutorial « ^^^^ , ^.^ 

on support vector machines for pattern recognition," Data 30 = '^l-^'OP] 
Mining and Knowledge Discovery, 2(2);121-167, 1998), 

The methods described here have the advantage that the _ pTp{5, |jtt) 

SVM outputs are combined in a principled way with the ~ '^"^'^j^} 
Gaussian density models. For the case when the classifier 

outputs are very noisy, as they are here, this is especially 35 

advantageous over the approach of directly using SVM If all priors P(S;) are equal then maximizing P(SJx) over all 

classifi^ers on the preprocessed data. S,. is equivalent to maximizing X^t^j'" log P(SjXjt) Otherwise, 

The target-speech model combines advantages of both one must find the S,- that maximizes: 
classifier and GNM approaches. In clusters where speaker 

discrimination is hard, the two distributions for the y^^ (for 40 « jPiSi\xO\ (^^^ 

positive and negative examples) in equation (12) will over- logP(5iJ 2j p^^i) J 
lap significantly, and the terms will be dominated by the 
priors P(SjCy), P(C,) and the likelihoods p(x|Cy), reducing to 

a GMM-like solution; in clusters where speaker discrimi- Training a DGMM thus proceeds in two steps: training the 

nation is easier, the term A(x, S,-, Cy) in equation (13) will 45 GMM, as described above (e.g., see the equations (4), (5) 

vary significantly, taking very small values for negative and (6) and respective text); and then using the resulting 

examples, and becoming 0(1) for positive examples, and the GMM to cluster the data to generate SVM training data. For 

method approaches that of classical pattern recognition. a given cluster, and a given SVM (there are SVMs per 

However, it is worth considering what the expansion repre- cluster), the positive examples are generated in order of 

sented by equation (8) means in this case. As in the standard 50 likelihood. Suppose speaker S's vectors are ordered by 

technique, the clusters are found using only target speech, likelihood for a given cluster C. Furthermore, suppose n^^ 

and the hope is that non-target speakers will get low like- is the number of speaker S's vectors that fall into cluster C 

lihoods. In fact much of the work around GMMs centers on when that speaker's vectors are assigned to all clusters 

how to add discrimination on top of the basic GMM via maximum likeHhood. The number of positive examples 

approach in order to compensate for the GMM*s tendency to 55 for the corresponding SVM is then taken to be max(njc, 

give high likelihoods to non-target speakers (e.g., see the n^-J, where n^„ is a user-chosen minimum number of 

above-cited article by Gish et al,). In the DGMM case, the positive examples required to train an SVM (e.g., assume 

basic model has two weapons for rejecting non-target speak- that n,^-„=250). If it is assumed that the number of negative 

ers: low cluster likelihood, P(Cy(x), and low classifier examples was fixed at 2000, the examples are again found 

outputs, P(Sjy,y(x),Cy). Since the DGMM here models only 60 by ordering the vectors for all speakers other than speaker S 

target speech, the role of the term y in equation (14) is to by likelihood for cluster C. Finally, after a given set of 

allow non-target speech to result in low likelihoods. The vectors have been set aside to train a given SVM, the next 

term y can be viewed as a constant approximation to a sum 100 negative and 100 positive examples (again ordered by 

over a set of other clusters, corresponding to data generated likelihood) are used as a validation set to estimate the 

by non-target speakers. 65 density in equation (13). 

The all -speech model method may be viewed as relying Note that this method is very easily extended to handle the 

on the binary classifiers for discrimination, and using the equivalent of cohort speakers. One can simply add the 
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(non-target) data to that on which the GMM was trained, sum over the kernel functions is in fact an inner product in 

before computing the SVM training data. Any data thus some Hilbert space H between the mapped point <t)(x)eH and 

added that gets high enough likelihood scores will appear as some vector w eH (the mapping (|», and the space H, are both 

negative examples for the SVMs. Finally, training speed and determined by the choice of kernel K). (In fact, K only 

generalization perfonmance can be improved by Mahalono- s determines the value of the inner product of two mapped 

bis renormalizdng the data, according to: vectors, according to K(x^,x^«^(Xi)-i^(x^). For a given 



'"L'^jc'm) (17) 



choice of K, in general neither the space H nor the mapping 
<() are uniquely determined.) The density of the random 



prior to SVM training and testing. Here L is the lower variable y in terms of p(x) and K may therefore be written 

triangular matrix used in the Cholseki decomposition of lo 

Note that this renormalization is not unique, in that 

there are other transformations that also result in x'^*x'«(x- C s(y-w^{x)-h)pix)dx 

/i)^2^(x-jm), but in our case this does not matter, since only ^ 

dot products of the data appear in both SVM training and f/ 'e* ) 

testing for the kernels used. Loosely speaking, this renor- 15 "J °\y - 2^<^iZif^ix, Xi) - b\p{x)dx. 
malization is removing variation from the data which is 
already accounted for by the GMM, thus easing the learning 

task faced by the SVM; note that the renormalized training In general, exact computation of this density is not 

data has zero mean and unit covariance matrix. feasible, since p(x) is unknown. Even if p(x) were known, an 

With respect to the test phase of a DGMM, clearly if all 20 exact expression may be elusive: one can view the kernel 

terms in the sum in equation (8) are allowed to appear, the mapping as a mapping of the data to an n-surface in H (recall 

test phase will be slow (e.g., for 20 speakers and 50 clusters, that x eR"), and this surface will in general have intrinsic 

1000 SVMs would have to be evaluated), and worse, SVMs curvature (e.g., see C. J, C. Burges, "Geometry and invari- 

will be applied to data for which they were not trained, in the ance in kernel based methods" in B. Scholkopf, C. J. C. 

sense that SVMs for all clusters will be appUed to data 25 Surges, and A. J. Smola, editors, "Advances in Kernel 

whose likelihood exceeds a given threshold in only a few Methods: Support Vector Learning " MIT Press, 1998), so 

clusters. Recalling that a given SVM is associated with a when one projects onto a given direction, p(y) may have 

particular cluster and a particular speaker (in the sense that contributions from quite difEerent parts of the mapped 

its training data is generated via likelihood computations for surface, making evaluation of equation (18) difficult. In the 

that cluster, and the positive/negative labels for the data are 30 special case when the mapped data is normally distributed in 

determined by that speaker's identity), there are two pos- H, it is straightforward to show that y must then itself be a 

sible remedies: (1) cluster the data via maximum likelihood, one dimensional Gaussian. Therefore, for the purposes of 

and then use the corresponding SVMs for each cluster, or (2) this paper it is assumed thaty has Gaussian distribution, 

only use an SVM on a given vector if its likelihood (P(C|x) At this point, attention is directed to the figures, which 

in equation (8) exceeds some user-chosen threshold. The 35 illustrate some applications of the inventive concept. Other 

latter method has been experimentally found to give better than the inventive concept, the elements shown in the figures 

results, which are described below. Note that both methods are well-known and will not be described in detail An 

require a means of handling vectors which fall below illustrative flow chart embodying the principles of the inven- 

threshold. If one simply does not include them in the sum, tion is shown in FIG. 1 for use in a voice mail system. In step 

then one is discarding evidence that this speaker is not a 40 50, a DGMM is trained on a set of target speakers (e.g., see 

target speaker. On the other hand, if they are included by the equations (4), (5) and (6) and respective text). As noted 

attaching a fixed penalty score for P(SjxC,-) in equation (8), earlier, training involves the labeling and collection of sound 

then the resulting contribution to the sum can swamp the clips of various people. For example, as a user retrieves 

contribution from the above -threshold vectors. One simple voice mail messages, the user labels each message as being 

heuristic solution to this is to penalize the mean log likeli- 45 associated with a particular person. In step 60, testing is 

hood resulting from just the above-threshold vectors by an performed on newly received voice mail. In particular, as 

amount which depends on the fraction of vectors which fell each newly received voice mail is stored, the corresponding 

below threshold. However, as described herein, only the sound clips are tested against the training data using the 

simplest approach, that of discarding the bclow-thrcshold DGMM. Test results include an identification of a particular 

vectors is considered. 50 person (whose data was used in the above-mentioned train- 

The above approach requires that one estimate the pos- ing phase) and a confidence level as to the accuracy of the 

terior probabilities P(Sjy,-^{x),Cj). As explained above, one decision. (Although not described, the labeling process 

method to do this is to express this quantity as a ratio of involves adding a "label" field to the storage parameters for 

densities as in equation (13). This has the advantage of a voice mail message. This label field is associated, e.g., via 

giving a principled way to incorporate the priors P(S,-|Cy), 55 a separately stored data base (not shown) with particular 

but it requires that one estimate the one-dimensional densi- individuals. Further examples of labeling and testing are 

ties of the support vector outputs. The solution to this is described below in the context of some illustrative systems 

described below. embodying the principles of the invention.) 

Denote the density of the input data by p(x). A support An illustrative DGMM system 100 in accordance with the 

vector machine may be viewed as a real valued function 60 principles of the invention is shown in FIG. 2. DGMM 100 

y(x)=l^ a, Zj K(x, x^+b, where the a are positive Lagrange comprises personal computer (PC) 105, local area network 

multipliers found by the training procedure, x,- arc the (LAN) 101, private branch exchange (PBX) 115, which 

training points and z,€{±l} their polarities, and b is a includes voice mail server (VMS) 110. It is assumed that PC 

threshold, also determined by training (for more details, see 105 represents one of a number of client- type machines 

for example the above-mentioned articles by Cortes et al.; 65 coupled to LAN 101. VMS 110, of PBX 115, is also coupled 

and Burges; and also V. Vapnik, "The Nature of Statistical to LAN 101. PBX 115 receives communications via facility 

Learning Theory," Springer- Verlag, New York, 1995). The 116, which represents any one of a number of voice- 
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switched lines, e.g., analog, Tl, etc. Illustratively, PBX 115 
receives telephone calls via facility 116 and, when 
necessary, provides voice mail functionality for the creation 
and storage of voice mail messages. For the purposes of this 
description, it is assumed that VMS 110 performs the 5 
above-described steps shown in FIG, 1. 

As is known in the art, a user associated with PC 105 
receives notification of newly received voice mail via LAN 
101 by a respective notification message from VMS 110. For 
example, the Qotification message can be a "pop-up" icon, 30 
email message, etc. In accordance with the principles of the 
invention, each notification message comprises at least the 
following information: an identification field and a confi- 
dence level. In particular, VMS 110 tests each newly 
received voice mail message using a DGMM, If VMS 110 is 
identifies a potential sender of the message, the identifica- 
tion field identifies the person and the confidence level 
provides an estimate as to the accuracy of the decision, or 
selection, of that person. If VMS 110 does not identify a 
potential sender, the identification field provides a suitable 20 
message, e.g., "not a target speaker." 

In accordance with the inventive concept, the user asso- 
ciated with PC 105 can gradually build-up a collection of 
sound clips for use in identifying callers. For example, 
initially, each notification message sent by VMS 110 to the 25 
user will indicate the message "not a target speaker." 
However, the user then "labels" the associated voice mail by 
replying, i.e., sending a responsive message, back to VMS 
110. The responsive message indicates the associated voice 
mail message and includes a name of a person. Upon receipt 30 
of the responsive message, VMS 110 creates a label for this 
newly identified person. Once sufficient sound clips have 
been collected for the newly identified person, VMS 110 
begins training. (It should be noted that other techniques can 
be used for identifying and labeling sound clips for training. 35 
For example, an application program, which, upon execu- 
tion in a client, recovers current stored voice mail message 
headers (e.g., a speech to text transcriptioD of calling party 
and/or calling number, etc.) from VMS 110 for display in 
one column of a two column list form on a screen (not 40 
shown) of PC 105. The other column of the list allows entry 
by the user of a name of an associated person, i.e., a label for 
the voice mail. A "submit" button causes the list information 
to be sent back to VMS 110, which then performs training 
on the identified sound clips. Other than the idea of using 45 
such an application program in conjunction with the inven- 
tive concept, such application programs use standard pro- 
gramming techniques and, as such, will not be described 
herein. NetScape® is illustrative of an application suite that 
will serve as such a platform.) Subsequently received voice 50 
mail messages are then tested against the trained data. 

Other illustrative systems are shown in FIGS. 3 and 4. 
These systems operate in a similar fashion to that shown in 
FIG. 2 and, as such, will not be described herein in the 
interests of brevity except as noted below. For example, it ss 
should be noted that in DGMM 200, of FIG. 3, VMS 210 is 
separate from, and separately coupled to, PBX 215. (It 
should also be noted that PBX 215 could communicate to 
VMS 210 via a LAN connection (not shown) to LAN 201.) 
With respect to FIG. 4, it is assiuned that a user associated 60 
with telephone 305, which is coupled to PBX 315, performs 
the above-described retrieving and labeling of voice mail 
messages via the touch- tone keypad of telephone 305. In this 
context, identification and confidence levels for each newly 
received voice mail are generated via audio signals from 65 
VMS 310. For example, in addition to announcing the time 
and date of a newly received voice mail, VMS 310 
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announces the identification (if possible) and the associated 
confidence level. 

FIG, 5 shows an illustrative voice mail server (VMS) 400 
embodying the principles of the invention. VMS 400 com- 
prises interface unit 405, which couples VMS 400 to, e.g., 
LAN 101. Processor 410 is a stored-program control pro- 
cessor and executes a DGMM program stored in memory 
420. Storage element 415 provides storage for sound chps, 
label data, etc. line 401 represents a suitable interface 
clement for coupling to a PBX (not shown). This coupling 
can either be internal if VMS 400 is a part of the PBX (as 
shown in FIG. 2) or external, if VMS 400 uses a separate 
connection to the PBX (as shown in FIG. 3). (If the PBX 
communicates with VMS 400 via a LAN, then, obviously, 
line 401 is not necessary.) 

It should be noted that although the inventive concept is 
illustratively shown in the context of a majority of the 
processing taking place in a voice mail server, the functional 
partitioning between a client and a server can be in any 
proportion. For example, training could be performed on a 
server, while the client performs testing (either using the 
data stored on the server, or periodically updating (from the 
server) a respective data base in the client), or training could 
also be performed on the client. 

Other variations are possible. For example, thresholds 
used during the testing process can be varied as a function 
of detected system differences. One such example is varying 
the thresholds used to test the sound clip as a function of the 
telephone used to convey the voice mail from the calling 
party. In this example, the thresholds used if the telephone 
is coupled to the PBX can be lower than if a call from an 
outside line is detected. Phone identification can be per- 
formed using caller identification, e.g., by incorporating 
such information in priors. 

The foregoing merely illustrates the principles of the 
invention and it will thus be appreciated that those skilled in 
the art will be able to devise numerous alternative arrange- 
ments which, although not explicitly described herein, 
embody the principles of the invention and are within its 
spirit and scope. 

What is claimed: 

1. In a method for performing speaker verification, 
wherein the identified speaker is determined to belong to a 
speaker set comprising N target spcas, where N>1, the 
improvement comprising: 

adding discrimtion to a Gaussian mixture model (GMM); 
wherein said discrination is based at least upon charac- 
teristics of all of said speaks; and 
wherein a single GMM is used for the speaker set. 

2. The method of claim 1 further comprising the step of 
using binary classifiers for selecting the identified speaker. 

3. The method of claim 2 wherein the binary classifiers are 
support vector machines. 

4. The method of claim 1 further comprising the steps of: 
training the discriminative GMM on speech; 
generating support vector machine training data by clus- 
tering data resulting from the GMM training step; and 

using at least one support vector machine for selecting the 
identified speaker. 

5. The method of claim 4 wherein the speech used for 
GMM training includes speech from at least one target 
speaker. 

6. The method of claim 4, wherein the using step uses a 
support vector machine to process a speech sample if other 
processing of the speech sample relative to a predefined 
threshold indicates that the speech sample exceeds the 
predefined threshold. 
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7. The method of daira 6, wherein the predefined thresh- 
old is a measure of a likelihood that the speech sample came 
from a GMM cluster. 

8. The method of claim 1 further comprising the steps of: 
training the GMM on speech; 

clustering data resulting from the training step; 
associating N support vector machines for each cluster; 
and 

using at least one of the support vector machines for 
selecting the identified speaker. 

9. The method of claim 8, wherein the speech used for 
GMM training includes speech from at least one target 
speaker. 

10. The method of claim 8, wherein the using step uses a 
support vector machine to process a speech sample if other 
processing of the speech sample relative to a predefined 
threshold indicates that the speech sample exceeds the 
predefined threshold. 

11. The method of claim 10, wherein the predefined 
threshold is a measure of a likelihood that the speech sample 
came from a cluster of GMM data. 

12. A method for use in a voice messaging system for 
verfying a speaker, the method comprising the steps of: 

training a Gausian mixture model (GMM) on speech, 
wherein the GMM is used for N speakrs, where N>1 
and wherein said training is based at least upon char- 
acteristics of all of said N speakers; 

generatmg support vector machine training data by clus- 
tering data resulting from the training step; and 

using at least one support vector machine to determine 
that the speaker belongs to the set of said N speakers. 

13. The method of claim 12 wherein the speech used for 
GMM training includes speech from at least one of the N 
speakers. 

14. The method of claim 12, wherein the using step uses 
a support vector machine to process a speech sample if other 
processing of the speech sample relative to a predefined 
threshold indicates that the speech sample exceeds the 
predefined threshold. 

15. The method of claim 14, wherein the predefined 
threshold is a measure of a likelihood that the speech sample 
came from a cluster of GMM data. 

16. The method of claim 12, wherein the voice messaging 
system comprises a server and at least one client and the 
steps of the method of claim 12 are performed in either one. 

17. The method of claim 12, wherein the voice messaging 
system comprises a server and at least one client and the 
steps of the method of claim 12 are distributed between 
them. 

18. A voice messaging system comprising: 
a client; 

a server for (a) storing a voice message, (b) verifying a 
speaker of the voice message as being from a set of N 
target speakers, where N>1, and (c) providing speaker 
verification information to the client; 
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wherein the server uses a discriminative Gaussian mixtue 
model (GMM) for verifying the speaker, and wherein 
the discriminative GMM is based at least upon char- 
acteristics of all of the N target speakers; and 
5 wherein the speaker is identified as being one of said N 
speakers, 

19, The system of claim 18 wherein the speaker verifi- 
cation information is further incorporated into a confidence 
level, 

10 20, Voice messaging apparatus comprising: 
means for receiving a voice message; and 
a processor that verifies a speaker of the voice message as 
belonging to a set of N possible speakers, where N>1; 

15 wherein the processor uses a discriminative Gaussian 
Mixture Model (GMM) for verifying the speaker, and 
where the same discriminative GMM is used for all the 
N possible speakers and wherein the discriminative 
GMM is based at least upon characteristics of all N 

20 possible speakers. 

21. An article of manirware for use in performing speaker 
verification comprising a machine readable medium com- 
prising one or more programs which when executed imple- 
ment the step of: 

25 using a discriminative Gaussian mixture model (DGMM) 
for use in associating a message with an unspecified 
speaker from a set of N possible speakers, where N>1, 
such that the DGMM uses one GMM for the N possible 
speakers; 

30 wherein said DGMM is based at least upon characteristics 
of all of the N possible speakers. 

22. The program of claim 21 further comprising the step 
of using a binary classifier for selecting the unspecified 
speaker. 

35 23. The program of claim 22 wherein the binary classifier 
is a support vector machine. 

24. The program of claim 21 further comprising the steps 
of: 

training the DGMM on speech; 

40 . , . . . 

generating support vector machme trainmg data by clus- 
tering data resulting from the training step; and 
using at least one support vector machine for selecting the 
associated speaker, 
45 25. The program of claim 24 wherein the speech used for 
DGMM training includes speech from the N possible speak- 
ers. 

26. The program of claim 24, wherein the using step uses 
one or more support vector machines to process a speech 

50 sample if other processing of the speech sample relative to 
a predefined threshold indicates that the speech sample 
exceeds the predefined threshold. 

27. The program of claim 26, wherein the predefined 
threshold is a measure of a likelihood that the speech sample 

55 came from one of the clusters. 

4> 4> >»> >» « 
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